Skip to content

feat: add support for FastSAM model with point, box and text prompts#1120

Merged
msluszniak merged 18 commits into
mainfrom
@bh/add-fast-sam
May 12, 2026
Merged

feat: add support for FastSAM model with point, box and text prompts#1120
msluszniak merged 18 commits into
mainfrom
@bh/add-fast-sam

Conversation

@barhanc
Copy link
Copy Markdown
Contributor

@barhanc barhanc commented May 5, 2026

Description

Adds support for FastSAM model with required postprocessing for point, box and text (using already existing CLIP export) prompts. Also adds an example app to test these.

Since FastSAM uses YOLO instance segmentation backbone with some clever postprocessing to imitate Facebook's SAM (see https://docs.ultralytics.com/models/fast-sam/#model-architecture), we use the existing instance segmentation C++ implementation and add TS postprocessing to minimize code duplication.

Introduces a breaking change?

  • Yes
  • No

Type of change

  • Bug fix (change which fixes an issue)
  • New feature (change which adds functionality)
  • Documentation update (improves or adds clarity to existing documentation)
  • Other (chores, tests, code style improvements etc.)

Tested on

  • iOS
  • Android

Testing instructions

  • Run the Computer Vision - Segment Anything app screen and test two available models there
  • You can also run Computer Vision - Instance Segmentation app screen and test the two newly added models there.
  • You can also run Computer Vision - Vision Camera app screen to test real-time performance of the two new models under 'Instance Segmentation'.
  • Check HF pages for the exported model https://huggingface.co/software-mansion/react-native-executorch-fast-sam

Screenshots

You can use following image for testing.

https://upload.wikimedia.org/wikipedia/commons/c/cd/Animal_diversity_October_2007.jpg

Simulator Screenshot - iPhone 17 Pro - 2026-05-06 at 23 12 00 Simulator Screenshot - iPhone 17 Pro - 2026-05-06 at 23 15 22

Related issues

Closes #555

Checklist

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have updated the documentation accordingly
  • My changes generate no new warnings

Additional notes

@barhanc barhanc self-assigned this May 5, 2026
@barhanc barhanc added feature PRs that implement a new feature model Issues related to exporting, improving, fixing ML models labels May 5, 2026
@barhanc barhanc changed the title feat: add support for FastSAM model with point and box prompts feat: add support for FastSAM model with point, box and text prompts May 6, 2026
@barhanc barhanc marked this pull request as ready for review May 6, 2026 22:12
@barhanc barhanc requested review from chmjkb and msluszniak May 6, 2026 22:45
Copy link
Copy Markdown
Member

@msluszniak msluszniak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add some benchmarks for this one?

Comment thread packages/react-native-executorch/src/constants/modelUrls.ts Outdated
Comment thread packages/react-native-executorch/src/utils/segmentAnythingPrompts.ts Outdated
Comment thread apps/computer-vision/app/segment_anything/index.tsx
Copy link
Copy Markdown
Collaborator

@chmjkb chmjkb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the demo app on iOS and the results vere pretty mid, at least for the S version. Not sure if this is the nature of the model, but just saying

Comment thread apps/computer-vision/components/vision_camera/tasks/InstanceSegmentationTask.tsx Outdated
Comment thread docs/docs/03-hooks/02-computer-vision/segment-anything.md Outdated
Comment thread apps/computer-vision/app/segment_anything/index.tsx
Comment thread packages/react-native-executorch/src/utils/segmentAnythingPrompts.ts Outdated
@msluszniak
Copy link
Copy Markdown
Member

I tested the demo app on iOS and the results vere pretty mid, at least for the S version. Not sure if this is the nature of the model, but just saying

Probably the nature of the model. You can share what results you get and I can do a cross-check.

@barhanc barhanc force-pushed the @bh/add-fast-sam branch from 252b53e to decff98 Compare May 8, 2026 10:22
@barhanc
Copy link
Copy Markdown
Contributor Author

barhanc commented May 8, 2026

I tested the demo app on iOS and the results vere pretty mid, at least for the S version. Not sure if this is the nature of the model, but just saying

From what I've tested, the S variant is fine for simple segmentation when objects don't overlap, but for more complex scenes it's true that artifacts show up. The X variant however worked fine on all images I tried, even ones with quite complex scenes. Did you observe bad performance on X variant also?

@chmjkb
Copy link
Copy Markdown
Collaborator

chmjkb commented May 8, 2026

I tested the demo app on iOS and the results vere pretty mid, at least for the S version. Not sure if this is the nature of the model, but just saying

From what I've tested, the S variant is fine for simple segmentation when objects don't overlap, but for more complex scenes it's true that artifacts show up. The X variant however worked fine on all images I tried, even ones with quite complex scenes. Did you observe bad performance on X variant also?

I gave some more testing to the X version, and the point/box based detections look cool, but the prompt-based ones were pretty bad. I almost never got the result I expected, but maybe its due to poor quality of text embeddings. quick example:

Zrzut ekranu 2026-05-08 o 15 11 47

@barhanc
Copy link
Copy Markdown
Contributor Author

barhanc commented May 11, 2026

@msluszniak @chmjkb I've:

  • updated docs by adding the section on selector use to useInstanceSegmentation.md, the previous file didn't really fit under hooks/ in my opinion.
  • moved the helper function bboxArea to utils/commonVision.ts as requested.
  • fixed keyboard handling - hopefully it feels better now.
  • fixed a problem with the way cropped images were passed to CLIP for embeddings - previously I followed exactly how it is done in ultralytics Python implementation where they simply pass the cropped image without masking; now the cropped image is masked based on the segmentation mask, so it should work better on examples as the one posted above. There are still some problems with text prompts on some images, but these are because of CLIP model not being able to properly embed images with certain parts of the image masked.
  • added optional topk parameter to selectByText that enables the user to return top-k best matching instances to the given text prompt. This is mostly for convenience and it doesn't truly solve the problem of returning multiple instances as topk must be passed explicitly, so e.g. it doesn't cover use cases where user would like to automatically count the number of objects matching certain prompt. The problem is that text/image embeddings are inherently contrastive and passing like a threshold doesn't really solve it. I could add a function that does something like an open-vocabulary classification where the user inputs text prompts that correspond to classes and for each segmented instance we return the best matching class based on cosine similarity, but I'm not sure if that's needed. What do you think?

I will be also adding benchmarks shortly.

@barhanc barhanc requested review from chmjkb and msluszniak May 11, 2026 10:36
Copy link
Copy Markdown
Member

@msluszniak msluszniak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from my side. We can add tip to documentation, that for images with overlaps of entities, prefer fastSam-X over smaller version.

@chmjkb
Copy link
Copy Markdown
Collaborator

chmjkb commented May 12, 2026

  • added optional topk parameter to selectByText that enables the user to return top-k best matching instances to the given text prompt. This is mostly for convenience and it doesn't truly solve the problem of returning multiple instances as topk must be passed explicitly, so e.g. it doesn't cover use cases where user would like to automatically count the number of objects matching certain prompt. The problem is that text/image embeddings are inherently contrastive and passing like a threshold doesn't really solve it. I could add a function that does something like an open-vocabulary classification where the user inputs text prompts that correspond to classes and for each segmented instance we return the best matching class based on cosine similarity, but I'm not sure if that's needed. What do you think?

@msluszniak @chmjkb I've:

  • updated docs by adding the section on selector use to useInstanceSegmentation.md, the previous file didn't really fit under hooks/ in my opinion.
  • moved the helper function bboxArea to utils/commonVision.ts as requested.
  • fixed keyboard handling - hopefully it feels better now.
  • fixed a problem with the way cropped images were passed to CLIP for embeddings - previously I followed exactly how it is done in ultralytics Python implementation where they simply pass the cropped image without masking; now the cropped image is masked based on the segmentation mask, so it should work better on examples as the one posted above. There are still some problems with text prompts on some images, but these are because of CLIP model not being able to properly embed images with certain parts of the image masked.
  • added optional topk parameter to selectByText that enables the user to return top-k best matching instances to the given text prompt. This is mostly for convenience and it doesn't truly solve the problem of returning multiple instances as topk must be passed explicitly, so e.g. it doesn't cover use cases where user would like to automatically count the number of objects matching certain prompt. The problem is that text/image embeddings are inherently contrastive and passing like a threshold doesn't really solve it. I could add a function that does something like an open-vocabulary classification where the user inputs text prompts that correspond to classes and for each segmented instance we return the best matching class based on cosine similarity, but I'm not sure if that's needed. What do you think?

I will be also adding benchmarks shortly.

I think the topk should be enough, no need to over-engineer this. I'll review it now and if it looks ok then :shipit:

@msluszniak msluszniak merged commit c0e3a83 into main May 12, 2026
5 checks passed
@msluszniak msluszniak deleted the @bh/add-fast-sam branch May 12, 2026 13:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature PRs that implement a new feature model Issues related to exporting, improving, fixing ML models

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SAM - segment anything model

3 participants